--- layout: page title: Script 5 - Probability & Uncertainty permalink: /scripts/script5/ parent: R Scripts nav_order: 5 ---
Run the code in this entire script in your own local R script - it is crucial to run the code yourself to familiarise yourself with the coding language. In general and for your own sake, I strongly recommend you always annotate your code including the output or result of your code unless it is trivial or already discussed below. Always make sure to be explicit about output and substantive results when responding to the tasks!
In this script, we will familiarise ourselves with the concepts of probability distributions, sampling distributions and the respective commands in R. We start by revising multivariate regression using Levin’s study on great power electoral interventions, before moving to probability and sampling. While this is not something you will usually be doing when analysing data, it is a crucial part of data analysis - its backend, if you will. We are borrowing the data from:
Levin, Dov H.c(2016). “When the Great Power Gets a Vote: The Effects of Great Power Electoral Interventions on Election Results” International Studies Quarterly Vol. 60, No.1, pp. 189-202
Levin provides evidence for the electoral consequences of attempts by great powers to intervene in a partisan manner in another country’s elections. Based on the assumption that foreign actors face strong incentives (motive and opportunity) to intervene in competitive elections, he argues that electoral interventions systematically increase the electoral chances of the supported candidate. Overt interventions, however, tend to be more effective than covert interventions. The types of intervention range from providing funding to the preferred side’s campaign to public threats to cutting off foreign aid. Between 1946 and 2000, the United States and the USSR/Russia intervened in this manner 117 times.
The dataset Levin.dta contains the following
variables:
| Variable Name | Variable Description |
|---|---|
country |
Country name |
year |
Election year |
incumbent_vote |
Vote share of the incumbent’s party (in parliamentary systems) or of the incumbent party’s presidential candidate (in presidential/semi-presidential systems) |
elect_int |
Partisan electoral intervention (USA and Russia). 1 if an intervention favours the incumbent, -1 if it favours a challenger, and 0 when no intervention occurs. |
prev_vote |
Previous vote share of the incumbent (Previous vote) |
rgdplpcgw |
Real GDP per capita growth rate (Growth) |
com_openk |
Trade as a percentage of GDP in constant terms (Trade Openness) |
prezelection |
Dummy variable measuring whether a particular election is a presidential election (1) or not (0) (Presidential Election) |
reelection_prez |
A president is running for reelection in a presidential or Semi-Presidential system: Yes=1, No=0 (Re-election) |
lfrag_com |
Effective number of parties or candidates contesting the election (Effective num. of Parties) |
lgcm_rgdpl_pc |
(Logged) GDP per capita (in thousand 2005 constant U.S dollars) |
africa |
Regional dummy: Africa |
asia |
Regional dummy: Asia |
eastcentraleurope |
Regional dummy: Central and E. Europe |
lamericacarribean |
Regional dummy: America and Caribbean |
coldwar |
Cold war dummy |
We will start by applying to this dataset the logic of multivariate
regression that we studied before. First, let’s load the data. This
time, we will use a .dta file, which is a different format
to store data than .csv (it is the way Stata stores its data). In order
to read a dta file, we need to install and call the
rio package in R:
#setwd if you work in RStudio Desktop
setwd("~")
#load data
install.packages("rio")
library(rio)
library(stargazer) #for tables of regression coefficients
Levin <- import("Levin.dta")
Our dependent variable is incumbent_vote and the main
independent variable of interest is elect_int, which takes
the values of 0 (no electoral intervention), 1
(pro-incumbent intervention) and -1 (pro-challenger
intervention).
table(Levin$elect_int) #main independent variable
plot(as.factor(Levin$elect_int)) #plot it [why are we using as.factor?]
We can look at the relationship between electoral interference and vote share for the incumbent with a simple bivariate regression:
#bivariate model
bivariate_model<-lm(incumbent_vote ~ elect_int, data = Levin)
summary(bivariate_model)
Considering the nature of the dependent and independent variable, how should we interpret the coefficients?
• What is the value of the intercept? What does it mean in
substantive terms? The value is 39.89 percentage Points. It
represents the expected vote share for the incumbent when there is no
intervention.
• What is the incumbent’s expected vote share in a pro-incumbent
intervention? 42.34 Percentage Points.
• What is the incumbent’s expected vote share in a pro-challenger
intervention? 37.45 %-points.
#how do you interpret the coefficients? You can calculate these values using the estimated coefficients
bivariate_model$coefficients[1]
bivariate_model$coefficients[1]+bivariate_model$coefficients[2]
bivariate_model$coefficients[1]-bivariate_model$coefficients[2]
Next, we start wondering whether other variables may be associated with the incumbent’s expected vote share. If a variable affects both the outcome variable and our independent variable, that variable is a confounder and needs to be added to the model to avoid bias. Can you think of any variables in the Levin dataset which may affect both the incumbent’s vote share in a given election and the probability of electoral interference by either the US or the USSR/Russia in favour of one of the candidates/parties?
Levin’s complete regression model controls for twelve control variables:
multivariate_model <- lm(incumbent_vote ~ elect_int + prev_vote + rgdplpcgw +
com_openk + prezelection + reelection_prez + lfrag_com + lgcm_rgdpl_pc +
africa + asia + eastcentraleurope + lamericacarribean, data=Levin)
stargazer(multivariate_model,type="text")
How do we interpret the coefficients of interest in this multivariate model?
• The intercept represents the incumbent’s expected vote share when all independent variables are set to zero. Note that this - substantively - may make sense or not.
• The coefficient for elect_int shows the mean effect of
different forms of intervention on the incumbent’s vote share
keeping other variables constant.
• The coefficients for continuous variables show the effect of a one-unit change in the associated variable on the expected vote share other things equal.
• The coefficients for the dummy variables indicate the magnitude of the shift in the intercept when the associated variable takes a value of 1, all other things equal.
Let’s address a question before moving on: why is there no coefficient for Western Europe and North America? It’s not that Levin forgot this region, or there weren’t any instances of intervention in the region. The reason is that all dummy indicators for a world’s region could be thought of as levels (or values) of a unique region variable. As you know from previous lectures, factorial variables with k levels can be added to a regression as k − 1 binary variables. If you included the k dummy variables in the regression at the same time, R would automatically leave one out, and that one would become the reference category.
In that case, the last level that we did not include as a dummy is captured by the intercept: here, the intercept represents the incumbent’s expected vote share in a Western European/North American country when all independent variables take a value of zero.
It might make sense to look at an example to be clear about this.
Imagine you have a multivariate model with variables for
female (binary) and education (continuous).
Your model explaining the level of wages then is:
\[Wage = \alpha_0 + \delta_0 female + \alpha_1 education + \epsilon\]
As you can see in Figures 1 & 2, the coefficient of the dummy variable translates into the - constant - distance over the effect of other variables. The slope, over education, indicated by \(\alpha_1\) here, remains the same for either group.
Having reviewed the fundamental tools of multivariate regression, we
now move to evaluating the concepts of probability and uncertainty using
the Levin data.
• In the lecture, you learn that all the variables we study in political science (or economics, for that matter) are actually random variables because each value they take is a result of a probabilistic process.
• The probability that a random variable takes a given value generates its probability distribution.
• The values of random variables must represent mutually exclusive and exhaustive events. For example, in a single coin toss, you either get head or tail - never both at once. Random variables can be categorical or continuous. This refers to the measurement scale of these variables as we discussed in script 2.
In statistics, there are several “standard” probabability distributions of random variables. Binary or discrete random variables may follow, for instance, binomial or Bernouilli probability distributions. Continuous variables may follow uniform or normal distributions. The normal distribution (also called the Gaussian distribution) has a mound-shaped and symmetric shape: data tend to cluster around the mean value and they are rarer and rarer the more we move away from the mean.
The normal distribution is defined by two parameters: the mean (\(μ\)) and the standard deviation (\(σ\)). Knowing these two parameters, we can draw any normal distribution. For example, here is the normal distribution of a random variable with a mean of 0 and standard deviation of 1:
You can think of the probability curve (or probability density function) as a histogram made up of bins with infinitesimally small width.
The normal distribution is so important because many natural and social science phenomena follow a normal distribution. Examples of real-world variables that follow the normal distribution include height, blood pressure, body temperature, and IQ scores. It is also pretty easy to define, and just by knowing the mean and standard deviations, we can know what proportion of the data are located within a given interval:
• Around 68.2% of the data are located within one standard deviation of the mean (i.e., in the (\(μ − σ\), \(μ + σ\)) interval).
• About 95.4% of the data are located within two standard deviations of the mean (i.e., in the (\(μ − 2σ\), \(μ + 2σ\)) interval).
• About 99.7% of the data are located within three standard deviations of the mean (i.e., in the (\(μ − 3σ\), \(μ + 3σ\)) interval).
Finally, the normal distribution with mean of 0 and standard deviation equal to 1 (like the one shown in the Figure) represents a special case called the standard normal distribution. Every other normal distribution can be ‘standardized’ (i.e. reduced to the standard form) by means of a simple formula called the z-score:
\[z = \frac{x_i - \overline{x}}{s_x}\]
Using the z-score, we can find the standard normal distribution’s equivalent of every value in any other normal distribution. Note that we only need to know a value (\(x_i\)), the mean (\(\overline{x}\)) and the standard deviation (\(s_x\)). You will see that this will become extremely helpful in the next script when we’ll discuss hypothesis testing.
The same probabilistic reasoning also applies to the statistical measures we have studied so far, including the mean and standard deviations. In fact, these are also random variables because their values will vary depending on the sample we use. Suppose that we draw all possible samples of size \(n\) from a given population. Suppose further that we compute a statistic (e.g. a mean, proportion, standard deviation) for each sample. The probability distribution of this statistic is called a sampling distribution (e.g. sampling distribution of the mean). And the standard deviation of this statistic is called the standard error.
Note: the standard deviation measures the amount of variability, or dispersion, for a subject set of data from the mean; whereas the standard error of the mean measures how far the sample mean of the data typically is from the true population mean.
The variability of a sampling distribution is measured by its standard error. The variability of a sampling distribution depends on three factors:
• N: The number of observations in the population.
• n: The number of observations in the sample.
• The way the random sample is selected (with or without replacement).
If the population size is much larger than the sample size, then the sampling distribution has roughly the same standard error, whether we sample with or without replacement. On the other hand, if the sample represents a significant fraction (say, 1/20) of the population size, the standard error will be meaningfully smaller when we sample without replacement.
The central limit theorem states that the sampling distribution of the mean of any random variable will be normal or nearly normal, if the sample size is large enough.
How large is “large enough”? The answer depends on two factors:
• Requirements for accuracy. The more closely the sampling distribution needs to resemble a normal distribution, the more sample points will be required.
• The shape of the underlying population. The more closely the original population resembles a normal distribution, the fewer sample points will be required.
In practice, some statisticians say that a sample size of 30 is large enough when the population distribution is roughly bell-shaped. Others recommend a sample size of at least 40. But if the original population is distinctly not normal (e.g., is badly skewed, has multiple peaks, and/or has outliers), researchers like the sample size to be even larger.
Let’s see whether that is true using the Levin data. We have 851 observations in our dataset. This is certainly not the universe of cases but suppose for now it is. We can draw an x-amount of random samples from that population and take the mean score of every sample.
First, we look at the distribution of the dependent
variable, incumbent_vote, using a histogram and a density
line.
## Inspection of the data
# Have a look at the dependent variable and its distribution
summary(Levin$incumbent_vote) #summarize our variable
#plot a histogram of the DV
hist(Levin$incumbent_vote,
freq = F,
breaks=seq(from=0, to =80, by=4), xlim=c(0,80),
main="Distribution of Vote Share", #change main title label
xlab="Incumbent Vote Share") #change x-axis title
# Plot a density line
lines(density(Levin$incumbent_vote), lty="dotted", col="blue", lwd=2)
As we can see, the distribution of vote shares for the incumbent is kind of bell-shaped, although it doesn’t look very symmetric and is slightly skewed.
We can then use the mean and the standard deviation of the variable
to generate a normal distribution: i.e. what would
incumbent_vote look like if it were normally
distributed? We use the dnorm function to achieve
this:
# Add a normal distribution for incumbent_vote
xfit <- seq(min(Levin$incumbent_vote), max(Levin$incumbent_vote),by=0.1)
#we take a sequene between the minimum and maximum value
#of the variable by an increase of 0.1 points
yfit <- dnorm(xfit, mean=mean(Levin$incumbent_vote), sd=sd(Levin$incumbent_vote))
# dnorm creates a normal distribution based on the characteristics we enter
#here we want to add the line for the normal distribution
# to the histogram above
hist(Levin$incumbent_vote,
freq=FALSE,breaks=seq(from=0, to =80, by=4),
xlim=c(0,80),main="Distribution of Vote Share",
xlab="Incumbent Vote Share")
# Plot a density line
lines(density(Levin$incumbent_vote), lty="dotted", col="blue", lwd=2)
lines(xfit, yfit, col = "red", lwd = 2)
We can see that the raw data resemble a normal distribution quite closely, although they do not perfectly map into it. Nonetheless, we can thus say that the central limit theorem applies here: with 851 observations, the variable approximates a normal distribution.
Now, let’s see if we can get to the mean value of
incumbent_vote with random sampling. We first pretend that
we don’t have access to the full dataset of 851 observations, but we can
only pick 50 observations at random. This way, we are in the same
position as a survey researcher who doesn’t have the resources to access
the full population they are studying.
R can help us generate a sample of random numbers from a dataset.
• First, we need to “set a seed”, i.e., give R a
number from which it will generate random values using an algorithm. The
set.seed() function sets the starting number used to
generate a sequence of random numbers - this ensures that you get the
same result if you start with that same seed each time you run the same
process. We will use the number 3 as our seed. Note that the seed
sets a sequence of random numbers, not the same random number for all
operations.
• Then, we specify that we want to sample 50 observations at random
from the Levin dataset without replacement (because 50 out
of 851 is quite a significant proportion).
# Drawing a random sample
set.seed(3)
# set.seed() = so that R's random number generator returns the same result every time
sample1 <- Levin[sample(nrow(Levin),50,replace=FALSE),] # draw a sample from the Levin dataset
#compare mean from population and mean from sample
mean(sample1$incumbent_vote)
mean(Levin$incumbent_vote)
Taking the mean of the sample yields a a value of 36.84.
This is pretty close to the ‘true’ mean of the variable
(39.95). However, we might have been lucky: if we drew
another random sample, maybe we wouldn’t have gotten such a close value.
In order to be sure, we need to draw many random samples.
R can help us do it by replicating or “looping” the
sample command using the replicate function. We can then summarize and
visualize the results from our sampling.
set.seed(3)
#150 samples of 50 from the Levin data
Lev_samples <- replicate(150, mean(sample(Levin$incumbent_vote,50,replace=FALSE))) # we calculate the mean of samples of 50 observations each time - repeating this exercise 150-times
summary(Lev_samples)
# Visualise the result using a histogram
hist(Lev_samples, freq=FALSE,
breaks=seq(from=30, to =50, by=1), xlim=c(30,50), ylim=c(0,0.20),
main="Sampling Distribution of Vote Share - 150 Samples",
xlab="Mean Incumbent Vote Share")
# Density lines
lines(density(Lev_samples), lty="dotted", col="blue", lwd=2)
# Original distribution
lines(density(Levin$incumbent_vote), lty="dotted", col="green", lwd=2)
# Normal distribution
xfit1 <- seq(min(Lev_samples), max(Lev_samples), by=0.1)
yfit1 <- dnorm(xfit1, mean=mean(Lev_samples), sd=sd(Lev_samples))
lines(xfit1, yfit1, col = "red", lwd = 2)
We see that, although the 150 random samples give us different means,
these are all clustered between 35 and 45, and their distribution
closely resembles a normal curve. The average of all these sample means
is 39.82, which is extremely close to the target value of
39.95. Thus, hadn’t we had access to the true value, we
could have estimated it pretty accurately by drawing random samples and
taking the average of the sample means. This can help us prove a series
of points:
• If the sample size is large enough (rule of thumb: >30), the distribution of sample means (what is called the sampling distribution) is approximately normal. That is, the central limit theorem applies to samples as well.
• The sampling distribution is centered around the population mean.
• We can use random samples to estimate a population’s parameter (mean or variance).
Does this also work when the underlying distribution of a variable is skewed? See Extension 1 for an application to find out!
Why do we care? We know that without further
information (explanatory factors), our best guess of the dependent
variable incumbent_vote is the sample mean, which should be
an unbiased estimate of \(\mu\), the
population mean (universe of cases). We could equally calculate any
other statistic (e.g. median, proportion, standard deviation) for that
sample. We also know that the sample mean will certainly not be a
perfect estimate since we are using a sample and not the entire
population. Therefore, if we took another sample we would expect the
mean in this sample to be a little different; by chance, the mean we
estimate is bound to be at least either a little bit higher or a little
bit lower than the population mean. Thus, we need to know how
accurate our estimate is – and accounting for accuracy is based on the
very idea of random sampling and its characteristics. This is the idea
behind measures of uncertainty and confidence intervals, which are at
the core of hypothesis testing. More on this soon!
Does the mean of the sampling distribution always approximate the population mean? See Extension 2 for a more extensive application!
Now, return to the Central Limit Theorem. Remember the idea
of (random) sampling: we are drawing an amount of random samples from
the population and taking the mean for every sample, before
comparing the actual distribution to the normal distribution. We did
that for the variable incumbent_vote, and found that the
original distribution was almost normal. Now, we want to know whether
this would also work when the underlying distribution of a variable is
skewed.
• Have a look at the distribution variable com_openk,
using the summary() and hist() commands; then
add a density line to your plot to visualise the original distribution
of the variable.
• Now, add a normal distribution for trade openness using the same
syntax as earlier for incumbent_vote, using the
seq() and dnorm() commands.
• Now, randomly draw 50 com_openk values, take the mean,
and repeat 150 times (before doing any random sampling, set a seed of 3
to make sure everyone gets the same results). Draw the distribution of
this variable using the same sequence of commands as previously, and fit
a normal distribution. What do you notice?
#example of a heavily skewed variable
summary(Levin$com_openk)
hist(Levin$com_openk, freq = F)
lines(density(Levin$com_openk), lty="dotted", col="blue", lwd=2)
xfit1 <- seq(min(Levin$com_openk), max(Levin$com_openk),by=0.1)
yfit1 <- dnorm(xfit1, mean=mean(Levin$com_openk), sd=sd(Levin$com_openk))
lines(xfit1, yfit1, col = "red", lwd = 2)
# you can clearly see the distribution is truncated from the left
#let's sample the population
set.seed(3)
Lev_samples1 <- replicate(150, mean(sample(Levin$com_openk,50,replace=FALSE)))
summary(Lev_samples1) #the average of sample means is close to the true mean!
hist(Lev_samples1, freq=FALSE)
lines(density(Lev_samples1), lty="dotted", col="blue", lwd=2)
lines(density(Levin$com_openk), lty="dotted", col="green", lwd=2)
xfit2 <- seq(min(Lev_samples1), max(Lev_samples1), by=0.1)
yfit2 <- dnorm(xfit2, mean=mean(Lev_samples1), sd=sd(Lev_samples1))
lines(xfit2, yfit2, col = "red", lwd = 2)
# the histograms resemble a normally distributed variable!
Now we are interested in knowing how accurate our estimate of the
mean is given that we compute this statistic from a sample, not
from the entire population. Does the mean of the sampling distribution
approach the population mean? Let us have a look fo the
incumbent_vote variable.
• First, we create an empty matrix to store the means.
means <- matrix(NA, nrow = 500, ncol = 3)
• Then draw 50 incumbent_vote values and take their
mean. Do it 50 times. We can use a loop to simplify this process.
set.seed(53) # we set seed
for(i in 1:50) { # we tell R we need to do this 50 times
means[i, 1] <- mean(sample(Levin$incumbent_vote, 50))
# save it in the matrix we have created in the first column
}
# fills the first column of the matrix.
Let’s plot this now!
hist(means[, 1], breaks = seq(33,46,.25)) # draw a histogram for the first resampling.
abline(v = mean(Levin$incumbent_vote, na.rm = T), col = "red", lwd = 2)
# add a vertical line for the mean of incumbent_vote from the population
abline(v = mean(means[,1], na.rm = T), col = "blue", lwd = 2)
# add a vertical line for the mean of incumbent_vote from the first resampling.
How does this look like?
Let’s continue doing the following:
• Draw 500 incumbent_vote values and take their mean. Do
this 50 times. Hint: use the same syntax as above.
• Draw 500 incumbent_vote values and take their mean. Do
this 500 times. Hint: use the same syntax as above.
• Now, we draw a histogram for each time we resample, and compare the mean from the new sample to the original mean.
What do you notice?
set.seed(53)
for(i in 1:500) {
means[i, 2] <- mean(sample(Levin$incumbent_vote, 50))
}
set.seed(53)
for(i in 1:500) {
means[i, 3] <- mean(sample(Levin$incumbent_vote, 500))
}
par(mfrow = c(1,3)) # to display the three histograms on the same graph, side by side.
hist(means[, 1], breaks = seq(33,46,.25)) # draw a histogram for the first resampling.
abline(v = mean(Levin$incumbent_vote, na.rm = T), col = "red", lwd = 2)
# add a vertical line for the mean of incumbent_vote from the population
abline(v = mean(means[,1], na.rm = T), col = "blue", lwd = 2)
# add a vertical line for the mean of incumbent_vote from the first resampling.
#same for 2 and 3
hist(means[,2], breaks = seq(33,47,.25))
abline(v = mean(Levin$incumbent_vote, na.rm = T), col = "red", lwd = 2)
abline(v = mean(means[,2], na.rm = T), col = "blue", lwd = 2)
hist(means[,3], breaks= seq(38, 42, .1))
abline(v = mean(Levin$incumbent_vote, na.rm = T), col = "red", lwd = 2)
abline(v = mean(means[,3], na.rm = T), col = "blue", lwd = 2)
par(mfrow = c(1,1)) #restore original number of plots
To verify our probabilistic intuitions, we are going to employ
random simulations, which combine the sample()
command with for loops (for more on loops, see: Imai, page
124).
A loop is a programming construct that allows us to repeatedly execute similar code chunks. The syntax in R is as follows:
• for (i in X) will create a loop, where the object
\(i\) (you can call this whatever you
like) takes all the values in the vector \(X\), iterating through each. So, when you
call for (i in 1:10), \(i\) will take the value of 1 at the first
iteration, 2 at the second iteration, and so on up to 10. In other
words, you apply the same code 10 times, just replacing the value
indicated by \(i\) with a sequence of
the numbers from 1 to 10.
• within the braces {...} you pass the code chunk you
want to iterate.
For instance,
for(i in 1:10) {
print(iˆ2)
}
will print the square of 1 in the first iteration of the loop, the
square of 2 in the second iteration, and so on up to 10. When
you are running for loops, make sure to run at once all the code from
the for expression to the closed brace
}.
Sometimes, we use loops to “fill up” an empty vector by indexing
\(i\), so that the vector will take in
position \(i\) the output of our code
in the \(i^{th}\) iteration of the
loop. For instance, the following code creates an empty vector of size
10, and then through a loop it adds in each position (1,2, … up to 10)
of the vector the result of the sample() function, which
selects a random number between 1 and 4, iterated 10 times.
output <- rep(NA, 10) #now the vector is empty
for(i in 1:10) {
output[i] <- sample(c(1,2,3,4), size = 1)}
output #now the vector is full.
If you want, you can also set \(i\) as the seed within the loop, so you will always get the same randomly generated values at every iteration.
output <- rep(NA, 10) #now the vector is empty
for(i in 1:10) {
set.seed(i)
output[i] <- sample(c(1,2,3,4), size = 1)
}
output #now the vector is full.